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Abstract 

We give an algorithm for completing an order-m symmetric low-rank tensor from its mul¬ 
tilinear entries in time roughly proportional to the number of tensor entries. We apply our 
tensor completion algorithm to the problem of learning mixtures of product distributions over 
the hypercube, obtaining new algorithmic results. If the centers of the product distribution are 
linearly independent, then we recover distributions with as many as fl(n) centers in polynomial 
time and sample complexity. In the general case, we recover distributions with as many as 
e(n) centers in quasi-polynomial time, answering an open problem of Feldman et al. (SIAM J. 
Comp.) for the special case of distributions with incoherent bias vectors. 

Our main algorithmic tool is the iterated application of a low-rank matrix completion algo¬ 
rithm for matrices with adversarially missing entries. 
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1 Introduction 


Suppose we are given sample access to a distribution over the hypercube {±1}”, where each sample 
X is generated in the following manner: there are k product distributions Pi,... over {zbl}*^ 
(the k “centers” of the distribution), and x is drawn from Pj with probability pi. This distribution 
is called a product mixture over the hypercube. 

Given such a distribution, our goal is to recover from samples the parameters of the individual 
product distributions. That is, we would like to estimate the probability pi of drawing from each 
product distribution, and furthermore we would like to estimate the parameters of the product 
distribution itself. This problem has been studied extensively and approached with a variety of 
strategies (see e.g. [FM99, CR08, FOS08]). 

A canonical approach to problems of this type is to empirically estimate the moments of the 
distribution, from which it may be possible to calculate the distribution parameters using linear- 
algebraic tools (see e.g. [AMOS, MR06, FOS08, AGH’^14], and many more). For product distribu¬ 
tions over the hypercube, this technique runs into the problem that the square moments are always 
1, and so they provide no information. 

The seminal work of Feldman, O’Donnell and Servedio [FOS08] introduces an approach to 
this problem which compensates for the missing higher-order moment information using matrix 
completion. Via a restricted brute-force search, Feldman et al. check all possible square moments, 
resulting in an algorithm that is triply-exponential in the number of distribution centers. Gontinuing 
this line work, by giving an alternative to the brute-force search Jain and Oh [J013] recently 
obtained a polynomial-time algorithm for a restricted class of product mixtures. In this paper we 
extend these ideas, giving a polynomial-time algorithm for a wider class of product mixtures, and a 
quasi-polynomial time algorithm for an even broader class of product mixtures (including product 
mixtures with centers which are not linearly independent). 

Our main tool is a matrix-completion-based algorithm for completing tensors of order m from 
their multilinear moments in time which we believe may be of independent interest. 

There has been ample work in the area of noisy tensor decomposition (and completion), see e.g. 
[J014, BKS15, TS15, BM15]. However, these works usually assume that the tensor is obscured 
by random noise, while in our setting the “noise” is the absence of all non-multilinear entries. An 
exception to this is the work of [BKS15], where to obtain a quasi-polynomial algorithm it suffices 
to have the injective tensor norm of the noise be bounded via a Sum-of-Squares proof.^ To our 
knowledge, our algorithm is the only n‘^(™'^-time algorithm that solves the problem of completing 
a symmetric tensor when only multilinear entries are known. 

1.1 Our Results 

Our main result is an algorithm for learning a large subclass of product mixtures with up to even 
D(n) centers in polynomial (or quasi-polynomial) time. The subclass of distributions on which 
our algorithm succeeds is described by characteristics of the subspace spanned by the bias vectors. 
Specifically, the rank and incoherence of the span of the bias vectors cannot simultaneously be too 
large. Intuitively, the incoherence of a subspace measures how close the subspace is to a coordinate 
subspace of M”. We give a formal definition of incoherence later, in Definition 2.3. 

More formally, we prove the following theorem: 

^It may be possible that this condition is met for some symmetric tensors when only multilinear entries are known, 
but we do not know an SOS proof of this fact. 
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Theorem 1.1. Let V he a mixture over k product distributions on {±1}"', with bias vectors 
vi^...,Vk G K"" and mixing weights wi,...,Wk > 0. Let span{uj} have dimension r and inco¬ 
herence pL. Suppose we are given as input the moments ofV. 

1. If vi,... ,Vk are linearly independent, then as long as 4 ■ fj, ■ r < n, there is a poly(n, fc) 
algorithm that recovers the parameters ofT>. 

2. Otherwise, if \{vi,Vj)\ < ||fi|| • ||uj|| • (1 — r/) for every i ^ j and rj > t), then as long as 

A ■ pL ■ r ■ log/c/log < n, there is an time algorithm that recovers the 

parameters ofV. 

Remark 1.2. In the case that vi,... ,Vk are not linearly independent, the runtime depends on the 
separation between the vectors. We remark however that if we have some Vi = Vj for i ^ j, then 
the distribution is equivalently representable with fewer centers by taking the center Vi with mixing 
weight Wi +Wj. If there is some Vi = —vj, then our algorithm can be modified to work in that case 
as well, again by considering vi and vj as one center-we detail this in Section 4. 

In the main body of the paper we assume access to exact moments; in Appendix B we prove 
Theorem B.2, a version of Theorem 1.1 which accounts for sampling error. 

The foundation of our algorithm for learning product mixtures is an algorithm for completing 
a low-rank incoherent tensor of arbitrary order given access only to its multilinear entries: 

Theorem 1.3. Let T be a symmetric tensor of order m, so that T = Wi ■ vf^ for some 

vectors vi,...,Vk G K"' and scalars wi,... ,Wk / 0. Let spanjuj} have incoherence /r and dimension 
r. Given perfect access to all multilinear entries ofT, if A-pL-r -m/n < 1, then there is an algorithm 
which returns the full tensor T in time 

1.2 Prior Work 

We now discuss in more detail prior work on learning product mixtures over the hypercube, and 
contextualize our work in terms of previous results. 

The pioneering papers on this question gave algorithms for a very restricted setting: the works 
of [FM99] and [C99, CGGOl] introduced the problem and gave algorithms for learning a mixture 
of exactly two product distributions over the hypercube. 

The first general result is the work of Feldman, O’Donnell and Servedio, who give an algorithm 
for learning a mixture over k product distributions in n dimensions in time with sample 

complexity Their algorithm relies on brute-force search to enumerate all possible product 

mixtures that are consistent with the observed second moments of the distribution. After this, they 
use samples to select the hypothesis with the Maximum Likelihood. Their paper leaves as an open 
question the more efficient learning of discrete mixtures of product distributions, with a smaller 
exponential dependence (or even a quasipolynomial dependence) on the number of centers.^ 

More recently, Jain and Oh [JOI3] extended this approach: rather than generate a large number 
of hypotheses and pick one, they use a tensor power iteration method of [AGH"*'14] to find the right 
decomposition of the second- and third-order moment tensors. To learn these moment tensors in 
the first place, they use alternating minimization to complete the (block)-diagonal of the second 
moments matrix, and they compute a least-squares estimation of the third-order moment tensor. 

^ We do not expect better than quasipolynomial dependence on the number of centers, as learning the parity 
distribution on t bits is conjectured to require at least time, and this distribution can be realized as a product 

mixture over 2*“^ centers. 
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Learning Product Mixtures with k Centers over {±1}" 


Reference 

Runtime 

Samples 

Largest k 

Dep. Centers? 

Incoherence? 

Feldman et al. [FOS08] 


nO{k) 

n 

Allowed 

Not Required 

Jain k. Oh [JO 14] 

poly(n, k) 

poly(n, k) 

k < 0(n^/^) 

Not Allowed 

Required 

Our Results ^dep^ 

poly(n,fc), 

„d(iogfc) 

poR(n,fc), 

„d(iogfe) 

k < 0(n) 

Allowed 

Required 


Figure 1; Comparison of our work to previous results. We compare runtime, sample complexity, and 
restrictions on the centers of the distribution: the maximum number of centers, whether linearly 
dependent centers are allowed, and whether the centers are required to be incoherent. The two 
subrows correspond to the cases of linearly independent and linearly dependent centers, for which 
we guarantee different sample complexity and runtime. 

Using these techniques, Jain and Oh were able to obtain a significant improvement for a restricted 
class of product mixtures, obtaining a polynomial time algorithm for linearly independent mixtures 
over at most k = centers. In order to ensure the convergence of their matrix (and tensor) 

completion subroutine, they introduce constraints on the span of the bias vectors of the distribution 
(see Section 2.3 for a discussion of incoherence assumptions on product mixtures). Specifically, 
letting r the rank of the span, letting /i be the incoherence of the span, and letting n be the dimension 
of the samples, they require that < n.^ Furthermore, in order to extract the bias vectors 

from the moment information, they require that the bias vectors be linearly independent. When 
these conditions are met by the product mixture, Jain and Oh learn the mixture in polynomial 
time. 

In this paper, we improve upon this result, and can handle as many as U(n) centers in some 
parameter settings. Similarly to [J013], we use as a subroutine an algorithm for completing low- 
rank matrices with adversarially missing entries. However, unlike [J013], we use an algorithm with 
more general guarantees, the algorithm of [HKZll].'^ These stronger guarantees allow us to devise 
an algorithm for completing low-rank higher-order tensors from their multilinear entries, and this 
algorithm allows us to obtain a polynomial time algorithm for a more general class of linearly 
independent mixtures of product distributions than [JO 13]. 

Furthermore, because of the more general nature of this matrix completion algorithm, we can 
give a new algorithm for completing low-rank tensors of arbitrary order given access only to the 
multilinear entries of the tensor. Leveraging our multilinear tensor completion algorithm, we can 
reduce the case of linearly dependent bias vectors to the linearly independent case by going to higher¬ 
dimensional tensors. This allows us to give a quasipolynomial algorithm for the general case, in 
which the centers may be linearly dependent. To our knowledge. Theorem 1.1 is the first quasi¬ 
polynomial algorithm that learns product mixtures whose centers are not linearly independent. 

Restrictions on Input Distributions. We detail our restrictions on the input distribution. In 
the linearly independent case, if there are k bias vector and is the incoherence of their span, 
and n is the dimension of the samples, then we learn a product mixture in time rfi so long as 

® The conditions are actually more complicated, depending on the condition number of the second-moment matrix 
of the distribution. For precise conditions, see [J013]. 

A previous version of this paper included an analysis of a matrix completion algorithm almost identical to that of 
[HKZll], and claimed to be the first adversarial matrix completion result of this generality. Thanks to the comments 
of an anonymous reviewer, we were notified of our mistake. 
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4^r < n. Compare this to the restriction that < n, which is the restriction of Jain 

and Oh-we are abie to handie even a iinear number of centers so iong as the incoherence is not 
too iarge, whiie Jain and Oh can handie at most centers. If the k bias vectors are not 

independent, but their span has rank r and if they have maximum pairwise inner product 1 — r] 
(when scaied to unit vectors), then we iearn the product mixture in time so iong 

as 4/iriog/c • iog < n (we aiso require a quasipoiynomiai number of sampies in this case). 

Whiie the quasipoiynomiai runtime for iineariy dependent vectors may not seem particuiariy 
giamorous, we stress that the runtime depends on the separation between the vectors. To iiiustrate 
the additionai power of our resuit, we note that a choice of random vi,... ,Vk in an r-dimensionai 
subspace meet this condition extremeiy weii, as we have rj = 1 — 0{l/y/r) with high probabiiity- 
for, say, k = 2r, the aigorithm of [J013] wouid faii in this case, since vi,... ,Vk are not iineariy 
independent, but our aigorithm succeeds in time 

This quasipoiynomiai time aigorithm resoives an open probiem of [FOS08], when restricted to 
distributions whose bias vectors satisfy our condition on their rank and incoherence. We do not soive 
the probiem in fuii generaiity, for exampie our aigorithm faiis to work when the distribution can have 
muitipie decompositions into few centers. In such situations, the centers do not span an incoherent 
subspace, and thus the completion algorithms we apply fail to work. In general, the completion 
algorithms fail whenever the moment tensors admit many different low-rank decompositions (which 
can happen even when the decomposition into centers is unique, for example parity on three bits). 
In this case, the best algorithm we know of is the restricted brute force of Feldman, O’Donnell and 
Servedio. 

Sample Complexity. One note about sample complexity-in the linearly dependent case, we 
require a quasipoiynomiai number of samples to learn our product mixture. That is, if there are k 
product centers, we require samples, where the tilde hides a dependence on the separation 

between the centers. In contrast, Feldman, O’Donnell, and Servedio require samples. This 

dependence on k in the sample complexity is not explicitly given in their paper, as for their algorithm 
to be practical they consider only constant k. 

Parameter Recovery Using Tensor Decomposition. The strategy of employing the spectral 
decomposition of a tensor in order to learn the parameters of an algorithm is not new, and has 
indeed been employed successfully in a number of settings. In addition to the papers already 
mentioned which use this approach for learning product mixtures ([JO 14] and in some sense [FOS08], 
though the latter uses matrices rather than tensors), the works of [MR06, AHK12, HK13, AGHK14, 
BCMV14], and many more also use this idea. In our paper, we extend this strategy to learn a more 
general class of product distributions over the hypercube than could previously be tractably learned. 

1.3 Organization 

The remainder of our paper is organized as follows. In Section 2, we give definitions and back¬ 
ground, then outline our approach to learning product mixtures over the hypercube, as well as 
put forth a short discussion on what kinds of restrictions we place on the bias vectors of the dis¬ 
tribution. In Section 3, we give an algorithm for completing symmetric tensors given access only 
to their multilinear entries, using adversarial matrix completion as an algorithmic primitive. In 
Section 4, we apply our tensor completion result to learn mixtures of product distributions over 
the hypercube, assuming access to the precise second- and third-order moments of the distribu¬ 
tion. Appendix A and Appendix B contain discussions of matrix completion and learning product 
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mixtures in the presence of sampling error, and Appendix C contains further details about the 
algorithmic primitives used in learning product mixtures. 

1.4 Notation 

We use Ci to denote the ith standard basis vector. 

For a tensor T G j^nxnxn^ T{a,b,c) to denote the entry of the tensor indexed by 

a,b,c G [n], and we use to denote the ith slice of the tensor, or the subset of entries in 

which the first coordinate is fixed to i G [n]. For an order-m tensor T G M”"*, we use T{X) to 
represent the entry indexed by the string X G [n]"*, and we use to denote the slice of T 

indexed by the string Y G For a vector v G M”, we use the shorthand to denote the 

fe-tensor x (8* x • • • ® x G 

We use n C [m] x [n] for the set of observed entries of the hidden matrix M, and Vn denotes 
the projection onto those coordinates. 

2 Preliminaries 

In this section we present background necessary to prove our results, as well as provide a short 
discussion on the meaning behind the restrictions we place on the distributions we can learn. We 
start by defining our main problem. 


2.1 Learning Product Mixtures over the Hypercube 


A distribution D over {il}” is called a product distribution if every bit in a sample x ~ Z) is 
independently chosen. Let Hi,..., be a set of product distributions over {±1}”. Associate with 
each Di a vector Vi G [—1,1]"' whose jth entry encodes the bias of the jth coordinate, that is 


P [x{3) = 1] 

x^T>i 


1 + 

2 


Define the distribution P to be a convex combination of these product distributions, sampling 
X ~ P = {x ~ Hj with probability Wi}, where Wi > 0 and X]ie[A:] ~ distributions 

Pi,..., Dk are said to be the centers of P, the vectors vi^... ,Vk are said to be the bias vectors, 
and wi,... ,Wk are said to be the mixing weights of the distribution. 


Problem 2.1 (Learning a Product Mixture over the Hypercube). Given independent samples from 
a distribution P which is a mixture over k centers with bias vectors vi,... ,Vk G [—1,1]"' and mixing 
weights wi,... ,Wk >0, recover vi,... ,Vk and wi,... ,Wk. 


This framework encodes many subproblems, including learning parities, a notorious problem in 
learning theory; the best current algorithm requires time and the noisy version of this problem 

is a standard cryptographic primitive [MOS04, Fel07, Reg09, VallSj. We do not expect to be able 
to learn an arbitrary mixture over product distribution efficiently. We obtain a polynomial-time 
algorithm when the bias vectors are linearly independent, and a quasi-polynomial time algorithm 
in the general case, though we do require an incoherence assumption on the bias vectors (which 
parities do not meet), see Definition 2.3. 

In [FOS08], the authors give an n^^^^^-time algorithm for the problem based on the following 
idea. With great accuracy in polynomial time we may compute the pairwise moments of P, 

M = E [xx"^] = E2 + Wi ■ VivJ. 
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The matrix E 2 is a diagonal matrix which corrects for the fact that Mjj = 1 always. If we were 
able to learn E 2 and thus access Yli^[k] ^ the “augmented second moment matrix,” we may 

hope to use spectral information to learn vi,... ,Vk- 

The algorithm of [FOS08] performs a brute-force search to learn E 2 , leading to a runtime 
exponential in the rank. By making additional assumptions on the input E and computing higher- 
order moments as well, we avoid this brute force search and give a polynomial-time algorithm for 
product distributions with linearly independent centers: If the bias vectors are linearly independent, 
a power iteration algorithm of [AGH^14] allows us to learn E given access to both the augmented 
second- and third-order moments.^ Again, sampling the third-order moments only gives access 
to = Es + where E^, is a tensor which is nonzero only on entries of 

multiplicity at least two. To learn E 2 and E 2 ,, Jain and Oh used alternating minimization and a 
least-squares approximation. For our improvement, we develop a tensor completion algorithm based 
on recursively applying the adversarial matrix completion algorithm of Hsu, Kakade and Zhang 
[HKZll]. In order to apply these completion algorithms, we require an incoherence assumption on 
the bias vectors (which we define in the next section). 

In the general case, when the bias vectors are not linearly independent, we exploit the fact that 
high-enough tensor powers of the bias vectors are independent, and we work with the 0(logA:)th 
moments of E, applying our tensor completion to learn the full moment tensor, and then using 
[AGH“*'14] to find the tensor powers of the bias vectors, from which we can easily recover the vectors 
themselves, (the tilde hides a dependence on the separation between the bias vectors). Thus if the 
distribution is assumed to come from bias vectors that are incoherent and separated, then we can 
obtain a significant runtime improvement over [FOS08]. 

2.2 Matrix Completion and Incoherence 

As discussed above, the matrix (and tensor) completion problem arises naturally in learning product 
mixtures as a way to compute the augmented moment tensors. 

Problem 2.2 (Matrix Gompletion). Given a set Q. C [m] x [n] of observed entries of a hidden 
rank-r matrix M, the Matrix Completion Problem is to successfully recover the matrix M given 
only Vn{M). 

However, this problem is not always well-posed. For example, consider the input matrix M = 
eiej CnS^. M is rank-2, and has only 2 nonzero entries on the diagonal, and zeros elsewhere. 
Even if we observe almost the entire matrix (and even if the observed indices are random), it is 
likely that every entry we see will be zero, and so we cannot hope to recover M. Because of this, 
it is standard to ask for the input matrix to be incoherent: 

Definition 2.3. Let U C be a subspace of dimension r. We say that U is incoherent with 
parameter /r if maxjgj^] || proj{;(ej)|p < /x^. If M is a matrix with left and right singular spaces 
U and V, we say that M is (/Ujj,/xi/)-incoherent if U (resp. V) is incoherent with parameter pu 
(resp pv)- We say that vi,...,Vk are incoherent with parameter p if their span is incoherent with 
parameter p. 

Incoherence means that the singular vectors are well-spread over their coordinates. Intuitively, 
this asks that every revealed entry actually gives information about the matrix. For a discussion on 

® There are actually several algorithms in this space; we use the tensor-power iteration of [AGH^14] specifically. 
There is a rich body of work on tensor decomposition methods, based on simultaneous diagonalization and similar 
techniques (see e.g. Jenrich’s algorithm [Har70] and [LCC07]). 
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what kinds of matrices are incoherent, see e.g. [CR09] . Once the underlying matrix is assumed to be 
incoherent, there are a number of possible algorithms one can apply to try and learn the remaining 
entries of M. Much of the prior work on matrix completion has been focused on achieving recovery 
when the revealed entries are randomly distributed, and the goal is to minimize the number of 
samples needed (see e.g. [CR09, Rec09, GAGG13, Harl4]). For our application, the revealed 
entries are not randomly distributed, but we have access to almost all of the entries (n(n^) entries 
as opposed to the Q(nr logn) entries needed in the random case). Thus we use a particular kind of 
matrix completion theorem we call “adversarial matrix completion,” which can be achieved directly 
from the work of Hsu, Kakade and Zhang [HKZll]: 

Theorem 2.4. Let M be anmxn rank-r matrix which is {frjj, fry)-incoherent, and let LI C [m] x [n] 
be the set of hidden indices. If there are at most k elements per column and p elements per row of 

and i/2(K^ + p^)r < 1, then there is an algorithm that recovers M. 

For the application of learning product mixtures, note that the moment tensors are incoherent 
exactly when the bias vectors are incoherent. In Section 3 we show how to apply Theorem 2.4 
recursively to perform a special type of adversarial tensor completion, which we use to recover the 
augmented moment tensors of T> after sampling. 

Further, we note that Theorem 2.4 is almost tight. That is, there exist matrix completion 
instances with n/n = 1 — o(l), p = 1 and r = 3 for which finding any completion is NP-hard 
[HMRW14, Pee96] (via a reduction from three-coloring), so the constant on the right-hand side 
is necessarily at most six. We also note that the tradeoff between n/n and p in Theorem 2.4 is 
necessary because for a matrix of fixed rank, one can add extra rows and columns of zeros in an 
attempt to reduce n/n, but this process increases p by an identical factor. This suggests that 
improving Theorem 1.1 by obtaining a better efficient adversarial matrix completion algorithm is 
not likely. 

2.3 Incoherence and Decomposition Uniqueness 

In order to apply our completion techniques, we place the restriction of incoherence on the subspace 
spanned by the bias vectors. At first glance this may seem like a strange condition which is 
unnatural for probability distributions, but we try to motivate it here. When the bias vectors 
are incoherent and separated enough, even high-order moment-completion problems have unique 
solutions, and moreover that solution is equal to X]ie[fc] . In particular, this implies that the 

distribution must have a unique decomposition into a minimal number of well-separated centers 
(otherwise those different decompositions would produce different minimum-rank solutions to a 
moment-completion problem for high-enough order moments). Thus incoherence can be thought of 
as a special strengthening of the promise that the distribution has a unique minimal decomposition. 
Note that there are distributions which have a unique minimal decomposition but are not incoherent, 
such as a parity on any number of bits. 

3 Symmetric Tensor Completion from Multilinear Entries 

In this section we use adversarial matrix completion as a primitive to give a completion algorithm 
for symmetric tensors when only a special kind of entry in the tensor is known. Specifically, we call 
a string X G [re]’" multilinear if every element of X is distinct, and we will show how to complete 
a symmetric tensor T G M"”* when only given access to its multilinear entries, i.e. T{X) is known 
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if X is multilinear. In the next section, we will apply our tensor completion algorithm to learn 
mixtures of product distributions over the boolean hypercube. 

Our approach is a simple recursion: we complete the tensor slice-by-slice, using the entries we 
learn from completing one slice to provide us with enough known entries to complete the next. The 
following definition will be useful in precisely describing our recursive strategy: 

Definition 3.1. Define the histogram of a string X G to be the multiset containing the number 
of repetitions of each character making at least one appearance in X. 

For example, the string (1,1,2, 3) and the string (4,4, 5,6) both have the histogram (2,1,1). 
Note that the entries of the histogram of a string of length m always sum to m, and that the length 
of the histogram is the number of distinct symbols in the string. 

Having defined a histogram, we are now ready to describe our tensor completion algorithm. 
Algorithm 3.2 (Symmetric Tensor Completion from Multilinear Moments). Input: The 
multilinear entries of the tensor T = Wi ■ vf^ + E, for vectors vi,... ,Vk G and 

scalars wi,... ,Wk > 0 and some error tensor E. Goal: Recover the symmetric tensor 

^ - Z^ielk] 

1. Initialize the tensor T with the known multilinear entries of T. 

2. For each subset Y G with no repetitions: 

• Let f(Y, •, •) G be the tensor slice indexed by Y. 

• Remove the rows and columns of T{Y, •, •) corresponding to indices present in 
Y. Complete the matrix using the algorithm of [HKZll] from Theorem 2.4 
and add the learned entries to T. 

3. For i = m — 2,... ,1: 

(a) For each X G [n]™' with a histogram of length i, if T{X) is empty: 

• If there is an element Xi appearing at least 3 times, let T = A \ {xi,Xi}. 

• Else there are elements Xi,Xj each appearing twice, let T = A \ {xi,Xj}. 

• Let r(y, •, •) G be the tensor slice indexed by Y. 

• Complete the matrix T{Y, •, •) using the algorithm from Theorem 2.4 and 
add the learned entries to T. 

4. Symmetrize T by taking each entry to be the average over entries indexed by the 
same subset. 

Output: T. 


Observation 3.3. One might ask why we go through the effort of completing the tensor slice-by- 
slice, rather than simply flattening it to an x matrix and completing that. The reason 

is that when spanui ,... ,Vk has incoherence /i and dimension r, spannf””^^,... may have 

incoherence as large as ^r'^/k, which drastically reduces the range of parameters for which recovery 
is possible (for example, if k = 0{r) then we would need r < n^/”^). Working slice-by-slice keeps 
the incoherence of the input matrices small, allowing us to complete even up to rank r = D(n). 

Theoreru 3.4. Let T be a symmetric tensor of order m, so that T = some 

vectors vi,...,Vk G R"" and scalars wi,... ,Wk 0. Let spanjuj} have incoherence /r and dimension 
r. Given perfeet aceess to all multilinear entries of T (i.e. E = f)), if 4: ■ fj. ■ r ■ m/n < 1, then 
Algorithm 3.2 returns the full tensor T in time 



In Appendix B, we give a version of Theorem 3.4 that accounts for error E in the input. 

Proof. We prove that Algorithm 3.2 successfully completes all the entries of T by induction on the 
length of the histograms of the entries. By assumption, we are given as input every entry with a 
histogram of length m. For an entry X with a histogram of length m — 1, exactly one of its elements 
has multiplicity two, call it Xj, and consider the set Y = X \ {xi,Xi}. When step 2 reaches Y, the 
algorithm attempts to complete a matrix revealed from T{Y,-,-) = Vy Wi ■ Vi{Y) ■ Vivf^, 

where Vi{Y) = Wj^YVi{j)., and Vy is the projector to the matrix with the rows and columns 
corresponding to indices appearing in Y removed. Exactly the diagonal of T{Y, •, •) is missing since 
all other entries are multilinear moments, and the (i,i)th entry should be T{X). Because the rank 
of this matrix is equal to dim(span(uj)) = r and 4:fj,rfn < Aurm/n < 1, by Theorem 2.4, we can 
successfully recover the diagonal, including T{X). Thus by the end of step 2, T contains every 
entry with a histogram of length t > m — 1. 

For the inductive step, we prove that each time step 3 completes an iteration, T contains every 
entry with a histogram of length at least i. Let X be an entry with a histogram of length i. When 
step 3 reaches X in the £th iteration, if T does not already contain T{X), the algorithm attempts 
to complete a matrix with entries revealed from T{Y,-,-) = Wi ■ Vi(Y) ■ Vivf , where T is a 

substring of X with a histogram of the same length. Since Y has a histogram of length i, every 
entry of r(y, •, •) corresponds to an entry with a histogram of length at least t'+l, except for the ixi 
principal submatrix whose rows and columns correspond to elements in Y. Thus by the inductive 
hypothesis, T{Y) is only missing the aforementioned submatrix, and since Afiri/n < A^rm/n < 1, 
by Theorem 2.4, we can successfully recover this submatrix, including T{X). Once all of the entries 
of T are hlled in, the algorithm terminates. 

Finally, we note that the runtime is because the algorithm from Theorem 2.4 runs in 

time O(n^), and we perform at most matrix completions because there are strings of 

length m — 2 over the alphabet [n], and we perform at most one matrix completion for each such 
string. □ 

4 Learning Product Mixtures over the Hypercube 


In this section, we apply our symmetric tensor completion algorithm (Algorithm 3.2) to learning 
mixtures of product distributions over the hypercube, proving Theorem 1.1. Throughout this 
section we will assume exact access to moments of our input distribution, deferring finite-sample 
error analysis to Appendix B. We begin by introducing convenient notation. 

Let P be a mixture over k centers with bias vectors vi,... ,Vk E [—1,1]*^ and mixing weights 
wi,... ,Wk >0. Define G to be the tensor of order-m moments of the distribution T>, so 
that A4® = Ea, [x®™]. Define 7^ G M”"* to be the symmetric tensor given by the weighted bias 
vectors of the distribution, so that Wi ■ x®™. 

Note that 7^ and A4® are equal on their multilinear entries, and not necessarily equal elsewhere. 
For example, when m is even, entries of indexed by a single repeating character (the “diagonal”) 
are always equal to 1. Also observe that if one can sample from distribution P, then estimating 
is easy. 

Suppose that the bias vectors of V are linearly independent. Then by Theorem 4.1 (due to 
[AGH”*"14], with similar statements appearing in [AHK12, HK13, AGHK14]), there is a spectral 
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algorithm which learns V given 7^ and (we give an acconnt of the algorithm in Appendix C). 

Theorem 4.1 (Consequence of Theorem 4.3 and Lemma 5.1 in [AGH"*'14]). Let D be a mixture over 
k centers with bias vectors vi,... ,Vk E [—1,1]” and mixing weights wi, ... ,Wk >0. Suppose we are 
given access to Wi ■ Vivf and ■ Then there is an algorithm which 

recovers the bias vectors and mixing weights ofD within e in time 0{n^ + • (log log 

Because TiP and Tp are equal to A4^ and on their multilinear entries, the tensor comple¬ 
tion algorithm of the previous section allows us to find Ip and TP from Ap and AP (this is only 
possible because TP and TP are low-rank, whereas AP and AP are high-rank). We then learn 
V by applying Theorem 4.1. 

A complication is that Theorem 4.1 only allows us to recover the parameters of T> if the bias 
vectors are linearly independent. However, if the vectors ui,... are not linearly independent, 
we can reduce to the independent case by working instead with for sufficiently large 

m. The tensor power we require depends on the separation between the bias vectors: 

Definition 4.2. We call a set of vectors vi,... ,Vk 7?-separated if for every i,j E [k] such that i ^ j, 

\{vi,Vj)\ < lluill • \\vj\\ ■ (1 - rj). 


Lemma 4.3. Suppose that vi,...,Vk E M” are vectors which are rj-separated, for rj > 0. Let 
m > [log^ k]. Then nf™,..., n®™" are linearly independent. 

1-77 

Proof. For vectors u,w G M”' and for an integer t > 0, we have that = {u,wy. If 

vi,... ,Vk are r/-separated, then for all i / j. 



<i(i-^n<i 


Now considering the Gram matrix of the vectors (]|^)'^”^) we have a k x k matrix with diagonal 
entries of value 1 and off-diagonal entries with maximum absolute value p This matrix is strictly 
diagonally dominant, and thus full rank, so the vectors must be linearly independent. □ 


Remark 4.4. We re-iterate here that in the case where r] = 0, we can reduce our problem to one 
with fewer centers, and so our runtime is never infinite. Specifically, if Vi = Vj for some i ^ j, 
then we can describe the same distribution by omitting Vj and including Vi with weight Wi + wj. 
If Vi = —Vj, in the even moments we will see the center Vi with weight Wi -|- wj, and in the odd 
moments we will see Vi with weight Wi — Wj. So we simply solve the problem by taking m' = 2m 
for the first odd m so that the are linearly independent, so that both the 2m'- and Sm'-order 
moments are even to learn Wi -|- Wj and Avi , and then given the decomposition into centers we can 
extract Wi and Wj from the order-m moments by solving a linear system. 

Thus, in the linearly dependent case, we may choose an appropriate power m, and instead apply 
the tensor completion algorithm to and to recover JP and Tp. We will then apply 

Theorem 4.1 to the vectors vf '",..., n®™" in the same fashion. 

Here we give the algorithm assuming perfect access to the moments of V and defer discussion 
of the finite-sample case to Appendix B. 

®We remark again that the result in [AGH^ 14] is quite general, and applies to a large class of probability distri¬ 
butions of this character. However the work deals exclusively with distributions for which M 2 = T 2 and M 2 , = Ts, 
and assumes access to T 2 and Ts through moment estimation. 
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Algorithm 4.5 (Learning Mixtures of Product Distributions). Input: Moments of the 
distribution V. Goal: Recover vi,... ,Vk and wi,..., w^- 

Let m be the smallest odd integer such that uf , u®™' are linearly independent. Let 
M = M. 2 m + E 2 and T = Aism + E 3 be approximations to the moment tensors of order 
2m and 3m. 

1. Set the non-multilinear entries of M and T to “missing,” and run Algorithm 3.2 on 

M and T to recover M' = + E 2 and T' = Y^-Wi ■ + E^ 

2. Flatten M' to the n™' x vE matrix M = + E 2 and similarly 

flatten T' to the n”^ x n™' x n”^ tensor T = Yl- Wi ■ -|- R'g, 

3. Run the “whitening” algorithm from Theorem 4.1 (see Appendix C) on {M,T) to 
recover wi,... ,Wk and ...,u®™. 

4. Recover vi,... ,Vk entry-by-entry, by taking the mth root of the corresponding entry 
in 

Output: wi,... ,Wk and vi,... ,Vk- 
Now Theorem 1.1 is a direct result of the correctness of Algorithm 4.5: 


Proof of Theorem 1.1. The proof follows immediately by combining Theorem 4.1 and Theorem 3.4, 
and noting that the parameter m is bounded by m < 2 -|- log k. □ 

1 - 7 ) 
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A Tensor Completion with Noise 

Here we will present a version of Theorem 3.4 which account for noise in the input to the algorithm. 

We will first require a matrix completion algorithm which is robust to noise. The work of 
[HKZll] provides us with such an algorithm; the following theorem is a consequence of their work.^ 

Theorem A.l. Let M be an mxn rank-r matrix which is (nu, fiy)-incoherent, and let H C \m] x [n] 
he the set of hidden indices. If there are at most k elements per column and p elements per row of 
Ll, and if 2( k ^ + pi^)r < 1, then let a = + P^)n and (3 = In particular, 

a < 1 and /3 < 1. Then for every (i > 0, there is a semidefinite program that computes outputs M 
satisfying 

,, X, . ^ 1 , 2(5a/ minfn, m) I I 

||M - Mil,. < 2,5 + 

We now give an analysis for the performance of our tensor completion algorithm, Algorithm 3.2, 
in the presence of noise in the input moments. This will enable us to use the algorithm on empirically 
estimated moments. 

Theorem A.2. Let T* be a symmetric tensor of order m, so that T* = Yli^[k] fnr some 

vectors vi,... ,Vk G M"" and scalars wi,... ,Wk ^ 0. Let spanjuj} have incoherence p and dimension 
r. Suppose we are given access to T = T* + E, where E is a noise tensor with \E{Y)\ < e for every 
y G [n]"^. Then if 

A ■ k ■ p ■ m < n, 

Then Algorithm 3.2 recovers a symmetric tensor T such that 

||f (A, •, •) - T*iX, •, OIIf < 4 • e • (5n3/2)™-i, 

for any slice T(A, •,•) indexed by a string X G in time In particular, the total 

Frobenius norm error \\T — T*||,. is bounded by A - e ■ (5n^/^) 2 ™“^. 

Proof. We proceed by induction on the histogram length of the entries: we will prove that an entry 
with a histogram of length £ has error at most 

In the base case oi i = m, we have that by assumption, every entry of E is bounded by e. 

Now, for the inductive step, consider an entry X with a histogram of length i < m — 1. In 
filling in the entry T{X), we only use information from entries with shorter histograms, which by 
the inductive hypothesis each have error at most a = Summing over the squared 

errors of the individual entries, the squared Frobenius norm error of the known entries in the slice in 
which T{X) was completed, pre-completion is at most n^a^. Due to the assumptions on k,p,m,n, 
by Theorem A.l, matrix completion amplihes the Frobenius norm error of (3 to at most a Frobenius 
norm error of 5/3 • Thus, we have that the Frobenius norm of the slice T{X) was completed 

in, post-completion, is at most and therefore that the error in the entry T{X) is as most 

e ■ as desired. 

This concludes the induction. Finally, as our error bound is per entry, it is not increased by the 
symmetrization in step 4. Any slice has at most one entry with a histogram of length one, 2n — 2 
entries with a histogram of length two, and — (2n — 1) entries with a histogram of length three. 
Thus the total error in a slice is at most A ■ £ ■ and there are slices. □ 

^In a previous version of this paper, we derive Theorem A.l as a consequence of Theorem 2.4 and the work of 
[CP09]; we refer the interested reader to http://arxiv.org/abs/1506.03137v2 for the details. 
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B Empirical Moment Estimation for Learning Product Mixtures 

In Section 4, we detailed our algorithm for learning mixtures of product distributions while assum¬ 
ing access to exact moments of the distribution T>. Here, we will give an analysis which accounts for 
the errors introduced by empirical moment estimation. We note that we made no effort to optimize 
the sample complexity, and that a tighter analysis of the error propagation may well be possible. 

Algorithm B.l (Learning product mixture over separated centers). Input: N indepen¬ 
dent samples xi,... from T), where T) has bias vectors with separation rj > 0. Goal: 
Recover the bias vectors and mixing weights of P. 

Let m be the smallest odd integer for which uf™,..., u®™' become linearly independent. 

1. Empirically estimate and by calculation M := 

2. Run Algorithm 4.5 on M and T. 

Output: The approximate mixing weights wi,... ,Wk, and the approximate vectors 

Di,... ,4. 


Theorem B.2 (Theorem 1.1 with empirical moment estimation). Let T) he a product mixture over 
k centers with bias vectors vi,... ,Vk G [—1,1]” and mixing weights wi,... ,Wk > 0. Let m be the 
smallest odd integer for which vf '^,..., u®”* are linearly independent (if vi, ... ,Vk are rj-separated 
for ri > 0, then m < log^ k). Define M 2 m = Yli£[k] . Suppose 

A ■ m ■ r ■ p < n, 

where p and r are the incoherence and dimension of the space spanjuj} respectively. Furthermore, 
let /3 < min {0{l/ky/Wma.x), be suitably small, and let the parameter N in Algorithm B.l satisfy 
^ (4 log n -|- log |) for e satisfying 

l3-ak{M2m) . ( 1 akiMfl'^ \ 

- 4 . (5„3/2)3m-2 (5n3/2)3m/2 j 

Finally, pick any p G (0,1). Then with probability at least 1 — 5 — p. Algorithm B.l returns 
vectors vi,... ,Vk and mixing weights wi,... ,Wk such that 

\\vi - Vi\\ < ^/n ■ (W ■ P + QO ■ f3 ■ \\M 2 m\\^^^ + , and \wi - Wi\ < AO/d, 

and runs in time ■ 0{N ■ poly(fc) log(l/? 7 ) • (log k + log log( "’™°'’‘ ))). In particular, a choice of 

N > gives sub-constant error, where the tilde hides the dependence on tCmin and crk{M 2 m)- 

Before proving Theorem B.2, we will state state the guarantees of the whitening algorithm of 
[AGH'^14] on noisy inputs, which is used as a black box in Algorithm 4.5. We have somewhat 
modified the statement in [AGH'*'14] for convenience; for a breif account of their algorithm, as well 
as an account of our modifications to the results as stated in [AGH'*'14], we refer the reader to 
Appendix C. 
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Theorem B.3 (Corollary of Theorem 4.3 in [AGH+14]). Let vi,... ,Vk G [—1,1]” he vectors and 
let wi^... ,Wk > 0 be weights. Define M = Wi ■ Vivf and T = suppose 

we are given M = M + Em and T = T + Et, where Em € and E^ G are symmetric 

error terms such that 

2 0 ^WEMllFy/Wuiax V^II^tIIf ^ 

ak{M) ^ afc(M)3/2 \,/w^-kJ- 

Then there is an algorithm that recovers vectors vi,... ,Vk and weights wi,... ,Wk such that for all 
i G [n], 

||wi -'Oill < + 60||M||^/^/3 + 10/3, and |r(;i - r&il < 40/3, 

with probability 1 — rj in time 0{L ■ ■ {log k + \og\og{-^j^^^))), where L is poly(/c) log(l/r/). 

Having stated the guarantees of the whitening algorithm, we are ready to prove Theorem B.2. 

Proof of Theorem B.2. We account for the noise amplification in each step. 

Step 1 : In this step, we empirically estimate the multilinear moments of the distribution. 
We will apply concentration inequalities on each entry individually. By a Hoeffding bound, each 
entry concentrates within e of its expectation with probability 1 — exp(—• e^). Taking a union 
bound over the + ( 3 ^) moments we must estimate, we conclude that with probability at least 
1 — exp(—+ 4m log n), all moments concentrate to within e of their expectation. Setting 
N = ^(4mlogn + log |), we have that with probability 1 — <5, every entry concentrates to within 
e of its expectation. 

Now, we run Algorithm 4.5 on the estimated moments. 

Step 1 of Algorithm 4.5: Applying Theorem A. 2, we see that the error satisfies — 

4 • e • and < 4 • e • 

Step 2 of Algorithm 4.5: No error is introduced in this step. 

Step 3 of Algorithm 4.5: Here, we apply Theorem B.3 out of the box, where our vectors 
are now the vf^. The desired result now follows immediately for the estimated mixing weights, 
and for the estimated tensored vectors we have \\ui — vf'^\\ < 10 • /3 + 60 • /3||M||^/^ + ||£' 2 ||, for 
/3 as dehned in Theorem B.2. Note that \\E 2 \\ < ' ^k{^ 2 m )so let 7 = 

10-/3 + 60-/3||M||V2 + ^^^. 

Step 4 of Algorithm 4.5: Let u* be the restriction of Ui to the single-index entries, and let 
V* be the same restriction for u®™". The bound on the error of the Ui applies to restrictions, so we 
have ||u* — u*|| < 7 . So the error in each entry is bounded by 7 . By the concavity of the mth root, 
we thus have that ||uj — UjH < ^/n ■ 7 ^/"*. 

To see that choosing N > gives sub-constant error, calculations suffice; we only add that 

||Af 2 m|| < rn'”, where we have applied a bound on the Frobenius norm of ||M 2 m||. The tilde hides 
the dependence on tCmin and (Tk{M 2 m)- This concludes the proof. □ 
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C Recovering Distributions from Second- and Third-Order Ten¬ 
sors 

In this appendix, we give an account of the algorithm of [AGH'''14] which, given access to estimates 
of My and Ty, can recover the parameters of T). We note that the technique is very similar to 
those of [AHK12, HK13, AGHK14], but we use the particular algorithm of [AGH'''14]. In previous 
sections, we have given a statement that follows from their results; here we will detail the connection. 

In [AGH"^14], the authors show that for a family of distributions with parameters vi,... ,Vk G K” 
and wi,... ,Wk > 0, if the vi,... ,Vk are linearly independent and one has approximate access to 
My := '^v '■= then the parameters can be recovered. For this, 

they use two algorithmic primitives: singular value decompositions and tensor power iteration. 

Tensor power iteration is a generalization of the power iteration technique for finding matrix 
eigenvectors to the tensor setting (see e.g. [AGH”*"14]). The generalization is not complete, and 
the convergence criteria for the method are quite delicate and not completely understood, although 
there has been much progress in this area of late ([AGJI4b, AGJI4a, GIIJY15]). However, it is well- 
known that when the input tensor T G decomposable into k < n symmetric orthogonal 

rank-1 tensors, i.e. T = where k < n and {vi,Vj) = 0 for z / j, then it is possible to 

recover vi,... ,vi~ using tensor power iteration. 

The authors of [AGH'*“14] prove that this process is robust to some noising of T: 

Theorem C.l (Theorem 5.1 in [AGH+14]). Let T = T + E & j^fcxfcxfc ^ symmetric tensor, 
where T has the decomposition T = ' '^i ® ® for orthonormal vectors ui, • • • ,Uk 

and Ai,..., > 0, and E is a tensor such that ||£^||f < P- Then there exist universal constants 
Cl, (72, C3 > 0 such that the following holds. Choose rj G (0,1), and suppose 

P < Cl 

k 

and also 

/ ln(L/log 2 (fc/??)) / ln(ln(L/log 2 (fc/r?))) +C3 / InS \ ^ 

V Ink y 41n(L/log 2 (A:/??)) y ln{L / log 2 {k / rj)) J ~ ' y V In/cy ' 

Then there is a tensor power iteration based algorithm that recovers vectors ui,... ,Uk and coeffi¬ 
cients Ai,..., Afc with probability at least 1 — rj such that for all z G [n], 

8 

\\ui — Ui\\ < and |Aj — Aj| < 5/3, 

in 0{L ■ kp ■ (log A: -|- log log(^^^))) time. The eonditions are met when L = poly(A:) log(l/? 7 ). 

The idea is then to take the matrix My, and apply a whitening map W = (My)^/^ to or- 
thogonalize the vectors. Because vi,... ,Vk are assumed to be linearly independent, and because 
WMW = = Wfc, it follows that the -y/uH ■ Wvi are orthogonal vectors. 

Now, applying the map W G to every slice of T in every direction, we obtain a new tensor 

Tw = Zlie WiiWvi)^^, by computing each entry; 

T{W, W, W)a,b,c ■= Tw(a, b,c)= a) • W^{b', b) • IT'^(c', c) • T{a', b', c'). 

l<a' ^c'<n 
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From here on out we will use T{A, A, A) to denote this operation on tensors. The tensor Tw 
thus has an orthogonal decomposition. Letting m = .^wiWvi, we have that T = -^= ■ uf^. 

Applying tensor power iteration allows the recovery of the u* = ■ Wvi and the weights 

from which the Vi are recoverable. 

The theorem Theorem B.3 is actually the consequence of Theorem C.l and the following propo¬ 
sition, which controls the error propagation in the whitening step. 

Proposition C.2 (Consequence of Lemma 12 of [HK13]). Let M 2 = X^jg[fc] A* • UiuJ he a rank-k 
PSD matrix, and let M be a symmetric matrix whose top k eigenvalues are positive. Let T = 
Eje[fc] Aj • uf^, and letT = T + Et where Et is a symmetric tensor with ||Fi7’||ir < 7 . 

Suppose IIM 2 — M\\f < eo'fc(M 2 ), where ak{M) is the kth eigenvalue of M 2 . Let U be the square 
root of the pseudoinverse of M 2 , and let U be the square root of the pseudoinverse of the projection 
of M to its top k eigenvectors. Then 

\\T{U, U, U) - nil, U, i/)|| < -^e + 7 • \\Uf\\U\\F 

V '^min 

Proof. We use the following fact, which is given as Lemma 12 in [HK13]. 

\\T{U,U,U)-f{U,U,U)\\ < -^=e+\\ET{U,U,U)\\2. 

V -^min 

The proof of this fact is straightforward, but requires a great deal of bookkeeping; we refer the 
reader to [HK13]. 

It remains to bound \\Et{U, U, U)\\ 2 . Some straightforward calculations yield the desired bound, 

\\E{U,U,U)h<Yi \\{Uei) ® U^EiUW < WUeihWU^EiUW 

i i 

< \\Uf • WUeMEiW < \\Uf • ^ WUcihWEiWF 

i i 

< \\Uf • < \\Uf • ||C>||f • Ilii^IlF, 

where we have applied the triangle inequality, the behavior of the spectral norm under tensoring, 
the submultiplicativity of the norm, and Cauchy-Schwarz. □ 


We now prove Theorem B.3. 


Proof of Theorem B.3. Let U be the square root of the projection of M to its top k eigenvectors. 
Note that ||f7|| < crfc(M 2 )“^/^, ||?7||f < Vkak{M 2 )~^^‘^ , and thus by Proposition C.2, the error E 
in Theorem C.l satisfies 

2« •= II F|| < ^W^mWf \\ET\\FVk 

afc(M2)V& ^fc(M2)3/2' 

Suppose 1/40 > 2/3 > ||L/||f- Applying Proposition C.2, we obtain vectors ui,...,Uk and 
scaling factors Ai,..., A^ such that ||uj — y/wl ■ M~^/‘^Vi\\ < 16 • /3 • y/wl and — Ai| < 5 • /3. The 
Wi are now recovered by taking the inverse square of the A^, so we have that when 10/3 < | < |Ai, 


Wi - Wi 


A? 


< 


1 


1 

(Ai ± 10/3)2 


< 5/3 • 


2Xi - 10/3 
X]{Xi - 10/3)2 


<40/3, 
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where to obtain the second inequality we have taken a Taylor expansion, and in the final inequality 
we have used the fact that 10/3 < \\i. 

We now recover Vi by taking Vi = Xi ■ Uui, so we have 

\\vi - Vi\\ < ||Aj • U^/lFi- - Vi\\ + ||Ai • U{ui - 

< {Xi ■ y/Wi)\\{U ■ - I)vi\\ + (1 - Xiy/uH)\\vi\\ + || 17 || • 16l3Xiy/uri 

< (1 + 10/3)||f7 • M-^/2 _ J|| ^ ^ ^ 

< (1 + 10/3)||f7 • M-^/2 _ J|| ^ ^ ^ 

It now suffices to bound ~ -^11) for which it in turn suffices to bound ||M — 

/||, since the eigenvalues of AA^ are the square eigenvalues of A. Consider ||(M“^/^nfc(M + 
EM)^k)Al~^A — /||^ where 11^ is the projector to the top k eigenvectors of M. Because both 
matrices are PSD, finally this reduces to bounding \\M — Ilk{M + EM)'^k\\- Since M is rank k, we 
have that ||M - Uk{M + Pljllfcll = o-fc+i(-EM) < ||^m||- 
Thus, taking loose bounds, we have 

Ibi - ^ill < \\Em\\^^^ + 60/3 • ||M2||fo2 + 10/3, 

as desired. □ 
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